The wine market is a dynamic, diversified sector of the economy where a number of factors affect how prices are formed. Historically, wine quality—which is frequently expressed in ratings—has been seen by both producers and customers as the main factor influencing pricing. But according to studies on consumer behavior and current market trends, wine pricing is significantly more intricate.
Vivino.com is a popular website for wine lovers to buy quality wine. Also, Vivino.com’s information on wine varietals, pricing, and ratings offers a rare chance to investigate the complex relationship between wine attributes and consumer views. Then we found a dataset scraped from Vivino.com. Motivated by a common love of wine among our enthusiast community, we want to explore the connection between wine costs and characteristics of grape varietals, origin areas, and overall scores. In order to shed light on the variables that affect how valuable and high-quality wines are viewed, this study looks for possible trends, preferences, and subtleties in the wine industry. With this study, we hope to: - Examine the relationship between wine prices and the kind of grapes included in the dataset.
Investigate the ways in which the location of wines affects their cost and rating.
To investigate if higher-rated wines often fetch higher prices and the overall effect of wine ratings on their market worth.
Highest Rated Wines Across Categories: Which type of wine, among the five major categories you’re considering, has the highest overall ratings? Which year produces the most welcoming (higher rating) wine within five different categories?
Most Welcoming Wine Valleys: Which wine valley is known for producing the most welcoming wines?
Regions with Expensive Wines: Which country produces the most expensive wines? Within countries, which regions produce the most expensive wines?
Price Analysis by Category in a Specific Region: How do wine prices vary within the five categories in a particular region? What is the associated visualization of the data for the distribution of the prices varies within regions?
Relationship Between Ratings and Prices: What is the probable statistical analysis to explore the correlation between wine ratings and their prices? Whether the relationship follows a simple linear regression model or a more complex multiple linear regression model.
The nature of the relationship: Whether it is positive (higher prices are associated with higher ratings) or negative (higher prices do not necessarily mean higher ratings).
We got our source data from https://www.kaggle.com/datasets/budnyak/wine-rating-and-price/code.
Orginally, the source data contains five individual csv files,
including: Red.csv, Rose.csv,
Sparkling.csv, Varieties.csv,
White.csv.
The Red.csv datasets has
8666 rows and 8 columns. There are a
total of 8 variables.
The Sparkling.csv datasets has
1007 rows and 8 columns. There are a
total of 8 variables.
The Varieties.csv datasets has
1512 rows and 1 columns. There are a
total of 1 variables.
The White.csv datasets has
3764 rows and 8 columns. There are a
total of 8 variables.
An example dataset Red.csv are loaded below for
indication. We filtered out Vairetis.csv because the
content is not consistent with the other files. We found that the
remaining files have the same variables,so our goal is to merge them
into one dataset.
head(red, 10) |>
knitr::kable(digits = 3) |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) |>
kableExtra::scroll_box(width = "100%", height = "300px")
| Name | Country | Region | Winery | Rating | NumberOfRatings | Price | Year |
|---|---|---|---|---|---|---|---|
| Pomerol 2011 | France | Pomerol | Château La Providence | 4.2 | 100 | 95.00 | 2011 |
| Lirac 2017 | France | Lirac | Château Mont-Redon | 4.3 | 100 | 15.50 | 2017 |
| Erta e China Rosso di Toscana 2015 | Italy | Toscana | Renzo Masi | 3.9 | 100 | 7.45 | 2015 |
| Bardolino 2019 | Italy | Bardolino | Cavalchina | 3.5 | 100 | 8.72 | 2019 |
| Ried Scheibner Pinot Noir 2016 | Austria | Carnuntum | Markowitsch | 3.9 | 100 | 29.15 | 2016 |
| Gigondas (Nobles Terrasses) 2017 | France | Gigondas | Vieux Clocher | 3.7 | 100 | 19.90 | 2017 |
| Marion’s Vineyard Pinot Noir 2016 | New Zealand | Wairarapa | Schubert | 4.0 | 100 | 43.87 | 2016 |
| Red Blend 2014 | Chile | Itata Valley | Viña La Causa | 3.9 | 100 | 17.52 | 2014 |
| Chianti 2015 | Italy | Chianti | Castello Montaùto | 3.6 | 100 | 10.75 | 2015 |
| Tradition 2014 | France | Minervois | Domaine des Aires Hautes | 3.5 | 100 | 6.90 | 2014 |
Data Cleaning:
categories = list.files(path = "./data", full.names = TRUE)
files_data = map(categories, read_csv)
nested_file = tibble(categories, files_data)
wine_rating = unnest(nested_file, cols = c(files_data))
wine_rating =
wine_rating |>
janitor::clean_names() |>
mutate(categories = str_extract(categories, "[A-Z][a-z]+")) |>
select(-variety)
write_csv(wine_rating, file = "wine_rating.csv")
When attempting to merge the four datasets, we encountered an initial
divergence in our approach. We employed two distinct methods to bind the
datasets: one involved using a left join, while the other entailed
combining the data using the map function. Both methods
successfully merged the four datasets into a comprehensive dataset
encompassing four wine categories. However, during our utilization of
the left join method, we encountered issues such as the omission of
category names and the development of complex code. Consequently, we
concluded that employing the map function would be a more
suitable approach for consolidating all the data.
head(wine_rating, 10) |>
knitr::kable(digits = 3) |>
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), font_size = 12) |>
kableExtra::scroll_box(width = "100%", height = "300px")
| categories | name | country | region | winery | rating | number_of_ratings | price | year |
|---|---|---|---|---|---|---|---|---|
| Red | Pomerol 2011 | France | Pomerol | Château La Providence | 4.2 | 100 | 95.00 | 2011 |
| Red | Lirac 2017 | France | Lirac | Château Mont-Redon | 4.3 | 100 | 15.50 | 2017 |
| Red | Erta e China Rosso di Toscana 2015 | Italy | Toscana | Renzo Masi | 3.9 | 100 | 7.45 | 2015 |
| Red | Bardolino 2019 | Italy | Bardolino | Cavalchina | 3.5 | 100 | 8.72 | 2019 |
| Red | Ried Scheibner Pinot Noir 2016 | Austria | Carnuntum | Markowitsch | 3.9 | 100 | 29.15 | 2016 |
| Red | Gigondas (Nobles Terrasses) 2017 | France | Gigondas | Vieux Clocher | 3.7 | 100 | 19.90 | 2017 |
| Red | Marion’s Vineyard Pinot Noir 2016 | New Zealand | Wairarapa | Schubert | 4.0 | 100 | 43.87 | 2016 |
| Red | Red Blend 2014 | Chile | Itata Valley | Viña La Causa | 3.9 | 100 | 17.52 | 2014 |
| Red | Chianti 2015 | Italy | Chianti | Castello Montaùto | 3.6 | 100 | 10.75 | 2015 |
| Red | Tradition 2014 | France | Minervois | Domaine des Aires Hautes | 3.5 | 100 | 6.90 | 2014 |
The resulting wine_rating datasets has
15346 rows and 9 columns. There are a
total of 9 variables. The description is listed
below:
categories: Four types of the wine
name: Name of each wine
country: Production country of the wine
region: Production place of the wine
winery: The production winery or vineyards
rating: The score each wine got from
consumer
number_of_ratings: Quantity of rating
price: The selling price of each wine
year: Production year
We started to analyze the dataset from two aspects: wine price versus geographic region and wine rating versus geographic region. To first visualize the wine price distribution in the global geographic region, we created a world map illustrating the average wine price of each countries inside the dataset. We then investigated the top 10 most expensive countries and show it in a world map as well.
1. World Map of Average Wine Price by Country:
price_country=
wine_rating |>
group_by(country)|>
summarise(avg_by_country = mean(price)) |>
filter(!is.na(avg_by_country))
world_map_data <-
price_country |>
mutate(text = paste(country, "<br>Avg Price: $", round(avg_by_country, 2)))
fig_country_all <- plot_ly(
data = world_map_data,
type = "choropleth",
locations = ~country,
locationmode = "country names",
z = ~avg_by_country,
text = ~text,
colorscale = "Viridis"
)
fig_country_all <-
fig_country_all|>
layout(
geo = list(
showframe = FALSE,
projection = list(type = 'mercator')
),
title = "Average Wine Price by Country"
)
fig_country_all
This world map visualizes average wine prices by country, offering a overview of the global wine market from the dataset. The data reflects average wine prices across all the different countries in the dataset. Each country is shaded according to its average wine price, with darker shades indicating lower average prices, while lighter shades represent higher prices. Notably, the United Kingdom stands out with the lightest shades, signifying the highest average wine price on the map, while Mexico is represented with the darkest shades, indicating the lowest average price.
2. World Map of The 10 Most Expensive Country:
price_country <-
price_country |>
arrange(desc(avg_by_country))
top_10_country =
head(price_country, 10) |>
mutate(rank = 1:10) |>
transform(rank = 1:10)
country10_map_data <- top_10_country |>
mutate(text = paste("Rank: ", rank, "<br>Country: ", country, "<br>Avg Price: $", round(avg_by_country, 2)))
fig_country_10map <- plot_ly(
data = country10_map_data,
type = "choropleth",
locations = ~country,
locationmode = "country names",
z = ~avg_by_country,
text = ~text,
colorscale = "YlGnBu"
)
fig_country_10map <-
fig_country_10map |>
layout(
geo = list(
showframe = FALSE,
projection = list(type = 'mercator')
),
title = "Top 10 Countries Based on Average Wine Price"
)
fig_country_10map
This map visually presents the top 10 countries with the highest average wine prices. The countries are ranked based on their average wine prices, with darker shades indicating lower prices. Each country is labeled with its rank, name, and the corresponding average price in US dollars. Based on this map, the top three most expensive countries are the UK, France, and the US.
3. Bar Plot of The 10 Most Expensive Region:
price_region=
wine_rating |>
group_by(country,region)|>
summarise(avg_by_region = mean(price)) |>
filter(!is.na(avg_by_region))|>
arrange(desc(avg_by_region))
top_10_regions <- head(price_region, 10)
fig_region10 <- plot_ly(
data = top_10_regions,
type = "bar",
x = ~reorder(region, -avg_by_region),
y = ~avg_by_region,
color = ~country,
text = ~paste("Country: ", country, "<br>Avg Price: $", round(avg_by_region, 2)),
marker = list(size = 10)
)
fig_region10 <-
fig_region10 |>
layout(
title = "Top 10 Regions Based on Average Wine Price",
xaxis = list(title = "Region"),
yaxis = list(title = "Average Wine Price"),
showlegend = TRUE
)
fig_region10
This bar chart displays the top 10 regions with the highest average wine prices. Each bar on the x-axis represents a region, sorted in descending order of average price, and its length corresponds to the average wine price on the y-axis. Bars are color-coded by countries, and the legend enhances context by identifying them. France prominently leads the list of the most expensive areas, with eight of the top 10 regions located in the it.
4. Distribution for Top 10 Winery:
price_winery=
wine_rating |>
group_by(country,region, winery)|>
summarise(avg_by_winery = mean(price)) |>
filter(!is.na(avg_by_winery))|>
arrange(desc(avg_by_winery))
price_winery_dis <- wine_rating %>%
group_by(country, region, winery) %>%
mutate(avg_by_winery = mean(price)) %>%
filter(!is.na(avg_by_winery)) %>%
select(country, region, winery, price, avg_by_winery) %>%
arrange(desc(avg_by_winery))
top_10_wineries <- price_winery_dis %>%
group_by(avg_by_winery) %>%
nest() %>%
arrange(desc(avg_by_winery)) %>%
head(10) %>%
unnest(cols = data)
# Create a Plotly boxplot with color-coded markers by region
boxplot_plotly <- plot_ly(
data = top_10_wineries,
type = "box",
x = ~winery,
y = ~price,
color = ~region, # Color by region
colors = "Set3" # Choose a color scale (Set3 provides clear distinctions)
)
# Customize layout
boxplot_plotly <- boxplot_plotly %>%
layout(
title = "Price Distribution for Top 10 Wineries",
xaxis = list(title = "Winery"),
yaxis = list(title = "Price"),
showlegend = TRUE # Display legend for region colors
)
# Display the Plotly plot
boxplot_plotly
This boxplot illustrates the price distribution of the top 10 wineries, with each box color-coded by region. Each box represents a winery, featuring a median line indicating the central price, box bounds denoting the interquartile range, and whiskers showing the price spread. However, some wineries appear as short lines, which might be due to limited data points or homogeneous prices in these wineries.
5. Wine Price in the United States:
price_us=
wine_rating |>
filter(country=="United States")|>
select(country,region, winery,price)
price_us_200=
price_us|>
filter(price<200)
# Create a histogram
histogram_plotly_us <- plot_ly(
data = price_us_200,
type = "histogram",
x = ~price,
nbinsx = 10, # Adjust the number of bins as needed
marker = list(color = "lightblue")
)
# Customize layout
histogram_plotly_us <- histogram_plotly_us |>
layout(
title = "Wine Price Distribution in the US (Filtered under 200)",
xaxis = list(title = "Price"),
yaxis = list(title = "Frequency")
)
# Display the histogram
histogram_plotly_us
This histogram illustrates the distribution of wine prices in the United States, with a specific focus on wines priced under $200. The x-axis represents the price range, while the y-axis denotes the frequency of wines falling within each price category. This visualization provides a quick overview of the frequency distribution of wine prices in the United States under the specified threshold. Notably, the histogram reveals that the majority of wines in the United States are concentrated in the price range of 0 to 19.99 dollars.
6. Relationship between Wine prices and rating for different categories:
# Assuming your summarized data is stored in a variable named wine_summary
wine_summary <- wine_rating %>%
group_by(country, year, categories) %>%
filter(number_of_ratings > 2000) %>%
summarise(mean_rating = mean(rating),
sd_rating = sd(rating),
n = n()) %>%
ungroup()
#remove missing values and any duplicates
wine_rating =
wine_rating |>
na.omit() |>
distinct() |>
mutate(name = gsub("\\d", "", winery), year = as.numeric(year),
name = iconv(name, from = "", to = "UTF-8", sub = ""),
winery = iconv(winery, from = "", to = "UTF-8", sub = ""),
region = iconv(region, from = "", to = "UTF-8", sub = ""))
This scatter plot shows the relationship between wine prices and ratings across different categories.
# Plotly scatter plot with loess smoothing
plot_ly(wine_rating, x = ~price, y = ~rating, type = 'scatter', mode = 'markers', color = ~categories) %>%
layout(title = "Average wine price by category and production region",
xaxis = list(title = "Price"),
yaxis = list(title = "Rating"))
Based on this graph, points are colored by category, including red, rosé, sparkling, and white wines. The graph suggests that there is a wide range of prices within each wine category, and there doesn’t appear strong correlation between price and rating. Some high-rated wines are available at moderate prices, while there are also expensive wines with lower ratings. However, we can see that for the red wine, we could see the trend that the red wine with higher price have higher rating.
7. Within countries, highest rating regions:
# #Within Countries, highest rating regions
wine_rating_summary <- wine_rating %>%
group_by(country, region) %>%
filter(number_of_ratings > 2000) %>%
summarise(mean_rating = mean(rating),
sd_rating = sd(rating),
n = n()) %>%
ungroup() %>%
arrange(region, desc(mean_rating)) %>%
top_n(20)
# Plotly bar plot
plot_ly(wine_rating_summary, x = ~reorder(region, -mean_rating), y = ~mean_rating, type = 'bar', color = ~country) %>%
layout(title = "Average wine rating by region (Within countries)",
xaxis = list(title = "Region"),
yaxis = list(title = "Mean Rating"))
The graph only shows the regions from which their wines received more than 2000 ratings. This bar chart compares the average wine rating by region within various countries. Regions are ordered by their mean rating. This visualization indicates that certain regions consistently produce higher-rated wines, but it also reflects the diversity within countries. For instance, wines from regions like Napa Valley, Barolo, and Bordeaux may have higher average ratings compared to other regions. Italy by far tops the list of the countries with the highest rated wine producing regions. Out of the top 27 regions in terms of mean-rating, almost half (13) of the wine producing regions are from Italy.
8. Rating analysis by category in a specific region (e.g., Napa Valley):
For the wines, the most famous winery we may know is Napa Vally, Napa County and California. To have a more general view for the analysis, we could narriow down the analysis to the ratings of individual regions and the wines that they produce
# Data Preparation
rating_analysis_data <- wine_rating |>
filter(region %in% c("Napa Valley", "Napa County", "California")) |>
filter(categories != "Rose") |>
group_by(region, categories) |>
summarise(mean_rating = mean(rating)) |>
spread(key = categories, value = mean_rating)
# Convert to matrix (required for heatmap)
rating_matrix <- as.matrix(rating_analysis_data[,-1])
rownames(rating_matrix) <- rating_analysis_data$region
# Plotly Heatmap
fig <- plot_ly(x = colnames(rating_matrix),
y = rownames(rating_matrix),
z = rating_matrix,
type = "heatmap",
colorscale = "Viridis") %>%
layout(title = 'Rating Analysis by Wine Category in Napa Valley, Napa County, and California',
xaxis = list(title = 'Wine Category'),
yaxis = list(title = 'Region'))
fig
This heatmap shows the average ratings of different wine categories specifically within Napa Valley, Napa County, and California as a whole. The color intensity reflects the mean rating.For most of us, the region that we are familiar with should be Napa County, Napa Valley. From the graph, the Red wines from Napa Valley, had the highest mean-ratings compared to the other two, just as the White category of wines from Napa County had the highest mean rating. For the white wine, the production from Napa Valley had the highest mean-ratings.
Based on previous visualization we got, we could analyze the most welcoming wine valleys.
9. Most Welcoming Wine Valleys:
# Filter data if needed
welcoming_valleys_data <- wine_rating |>
filter(number_of_ratings > 2000) |>
group_by(region, categories) |>
summarise(mean_rating = mean(rating),
number_of_ratings = number_of_ratings,
mean_price = mean(price)) |>
ungroup() |>
filter(mean_rating > 4.3, mean_price > 0) # Ensures that mean_price is greater than 0
# Create a Plotly Bubble Chart
fig <- plot_ly(data = welcoming_valleys_data,
x = ~mean_price,
y = ~mean_rating,
size = ~number_of_ratings,
color = ~region,
colorscale = "Viridis",
type = 'scatter',
mode = 'markers',
marker = list(sizemode = 'diameter', sizeref = 0.7, opacity = 0.4)) %>%
layout(
title = 'Most Welcoming Wine Valleys by Wine Category',
xaxis = list(title = 'Mean Price', range = c(0, max(welcoming_valleys_data$mean_price))),
yaxis = list(title = 'Mean Rating')
)
# Print the figure
fig
This bubble plot shows the relationship between the mean-prices and mean-ratings of wines, for only the wines that had more than 2000 ratings, and whose mean-rating was greater than 4.3 all grouped by the wine category. Regions with larger bubbles and higher placements on the Y-axis (mean rating) could be interpreted as more “welcoming” due to their combination of high ratings and a significant number of ratings, suggesting popular approval.The mean price on the X-axis gives an additional dimension, indicating if the quality comes with a higher cost. The region represented by the large, orange bubble (represents ‘Bolgheri Superiore’ region) seems to be the most “welcoming,” as it has a high mean rating and a significant number of reviews. In the plot, there is a wide range of mean prices, but a cluster of regions has mean ratings in the narrow range of approximately 4.3 to 4.6.Also,not all high-rated wines are expensive, and not all expensive wines are highly rated, indicating that price is not the sole determinant of quality as perceived by raters.
Then we focus on the association between rating and price within four categories.
9. Numbers within Wine Types:
all_wine_cate = wine_rating |>
filter(categories != "Varieties") |>
group_by(categories) |>
summarise(number = n())
all_wine_cate
## # A tibble: 4 × 2
## categories number
## <chr> <int>
## 1 Red 8666
## 2 Rose 397
## 3 Sparkling 1007
## 4 White 3764
The table lists the number of wine entries under the categories of Red, Rose, Sparkling, and White. The sample sizes are noticeably out of balance. With 8,666 points, red wine has the highest count, indicating a wider variety or higher popularity; rose wine has the lowest, at 397, suggesting a smaller market presence or less variety in offerings. White wines and sparkling wines have moderate counts of 3,764 and 1,007, respectively, indicating a reasonable but limited variety compared to red wines. This distribution may represent consumer preferences, market trends, or data collection techniques, and it suggests potential biases for statistical analysis because of the wide range of sample sizes.
10. Distribution of wine type verses price:
A box plot of the wine prices for each category is produced in this section. Additionally Wine prices are plotted on the y-axis, while wine genres are plotted on the x-axis. The color of each box plot is determined by the category. We have the following conclusions based on the plot we created.
price_plot= wine_rating |>
plot_ly(y = ~price, x= ~categories, color = ~ categories, type = "box", colors = "viridis") |>
layout(yaxis = list(range = c(0, 1500),dtick = 50))
price_plot
Red Wines: The cost of red wines varies greatly, with several outliers suggesting that some are far more expensive than others. The priciest bottle of red wine may cost up to $3417. And over $1000, there are plenty of high-end possibilities. Considering that the highest-priced wines have a median price of $18.2, red wine can typically be afforded, even when there are premium selections.
Rosé Wines: The price distribution of rosé wines is more centered and has fewer outliers. This suggests that there are less expensive outliers and more consistent pricing for rosé wines.
Sparkling Wines: There are a few outliers in the moderate range of pricing for sparkling wines. Given that the median price of 19.45 is greater than that of red and rosé wines, it is possible that sparkling wines are typically seen as a more upscale choice.
White Wines: With a wide price range and a few outliers, white wines follow a pattern akin to that of red wines. However, the median price of 13.15 is less than that of red wines, suggesting that the typical price of white wines may be lower.
11. Distribution of wine type verses rating:
Second, we make a box plot that, like the pricing plot, shows the distribution of ratings across several wine categories. The rating of the wines is shown on the y-axis in this case. Given that wine ratings usually range from 0 to 5, the y-axis range is set at 0 to 5. There are 0.2 intervals between each tick mark.
rating_plot= wine_rating |>
plot_ly(y = ~rating, x= ~categories, color = ~ categories, type = "box", colors = "viridis") |>
layout(yaxis = list(range = c(0, 5), dtick = 0.2))
rating_plot
The overall distributions of ratings for various wines don’t differ significantly. However, we may still draw some conclusions from the plot we created.
Red Wines: The highest median grade is found in red wines. In comparison to other categories, red wines have a lower median rating and a somewhat broad rating distribution. A few low outliers might mean that certain red wines are not as well regarded as others.
Rosé Wines: The interquartile range (IQR) of rosé wines is smaller, indicating a higher degree of consistency in evaluations.
Sparkling Wines: The narrow IQR and lack of notable outliers in sparkling wines suggest a uniform ranking throughout the category. The high median rating may indicate that quality is usually viewed favorably.
White Wines: White wines are distributed in a manner similar to that of red wines. The IQR of white wines is modest. Overall, the evaluations seem to indicate that white wines are well rated, with a few exceptions at the lower end.
1. T-test to compare mean price among four categories.
To better study the average prices of the various wine categories in our dataset Pairwise Welch Two Sample t-tests were used in your analytical method. We think the Welch t-test is a reliable option for real-world since it is primarily used in situations when the variances of the two groups being compared may not be identical.
# Filter the data for the four categories
red_wines <- subset(wine_rating, categories == "Red")
white_wines <- subset(wine_rating, categories == "White")
rose_wines <- subset(wine_rating, categories == "Rose")
sparkling_wines <- subset(wine_rating, categories == "Sparkling")
# Perform T-tests & Print results
Comparing Red and White:
t_test_red_white <- t.test(red_wines$price, white_wines$price)
print(t_test_red_white)
##
## Welch Two Sample t-test
##
## data: red_wines$price and white_wines$price
## t = 17.771, df = 12155, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 16.48147 20.56802
## sample estimates:
## mean of x mean of y
## 39.14506 20.62032
A very significant difference in the mean prices of red and white wines is shown by the Welch Two Sample t-test, with a t-statistic of 17.771 and a p-value less than 2.2e-16. This test has 12155 degrees of freedom, which suggests a sizable sample size. The confidence interval, which spans from 16.48147 to 20.56802, indicates that red wines regularly cost much more than white wines. The average cost of red wines is around 39.15, while the average cost of white wines is about 20.62. This significant price differential between the two groups reflects their different market positioning, with red wines perhaps being seen as more desired or premium in specific situations.
Comparing Red and Rose:
t_test_red_rose <- t.test(red_wines$price, rose_wines$price)
print(t_test_red_rose)
##
## Welch Two Sample t-test
##
## data: red_wines$price and rose_wines$price
## t = 21.878, df = 1922.4, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 24.23314 29.00550
## sample estimates:
## mean of x mean of y
## 39.14506 12.52574
The Welch t-test results in a t-statistic of 21.878 with a p-value less than 2.2e-16 when red and rose wines are compared, suggesting an even more marked difference in mean pricing than there is between red and white wines. Here, the sample sizes and variances are reflected in the 1922.4 degrees of freedom. Red wines are much more costly than rose wines, as seen by the confidence interval, which spans from 24.23314 to 29.00550. The mean prices of red wines are 39.15, while those of rose wines are 12.53. This finding implies that there is a distinct market distinction between these groups, with red wines being priced more.
Comparing Red and Sparkling:
t_test_red_sparkling <- t.test(red_wines$price, sparkling_wines$price)
print(t_test_red_sparkling)
##
## Welch Two Sample t-test
##
## data: red_wines$price and sparkling_wines$price
## t = 2.4847, df = 1871.1, p-value = 0.01305
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.9147425 7.7685347
## sample estimates:
## mean of x mean of y
## 39.14506 34.80343
The t-statistic for the comparison of red and sparkling wines is 2.4847, with a p-value of 0.01305, suggesting a statistically significant difference that is not as strong as it was in the earlier comparisons. The smaller confidence range, which falls between 7.7685347 and 0.9147425, indicates a slight variation in mean prices. The mean price of red wines is roughly 39.15, which is somewhat more than the mean price of sparkling wines, which is approximately 34.80. A tighter pricing range is seen in this comparison between the two groups, maybe as a result of comparable manufacturing costs or overlapping market sectors.
Comparing White and Rose:
t_test_white_rose <- t.test(white_wines$price, rose_wines$price)
print(t_test_white_rose)
##
## Welch Two Sample t-test
##
## data: white_wines$price and rose_wines$price
## t = 8.5229, df = 755.37, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.230128 9.959023
## sample estimates:
## mean of x mean of y
## 20.62032 12.52574
A considerable price difference between white and rose wines is revealed by the t-test, which has a t-statistic of 8.5229 and a p-value of less than 2.2e-16. With a mean price of 20.62, white wines are considerably more costly than rose wines, which have a mean price of 12.53. This is evident from the confidence interval, which spans from 6.230128 to 9.959023. This discrepancy suggests that these two wine varieties have different market positionings as well as potential variations in manufacturing processes or customer perceptions.
Comparing White and Sparkling:
t_test_white_sparkling <- t.test(white_wines$price, sparkling_wines$price)
print(t_test_white_sparkling)
##
## Welch Two Sample t-test
##
## data: white_wines$price and sparkling_wines$price
## t = -9.0158, df = 1245.1, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -17.26941 -11.09680
## sample estimates:
## mean of x mean of y
## 20.62032 34.80343
The Welch t-test reveals a t-statistic of -9.0158 and a p-value less than 2.2e-16 when comparing the mean prices of white and sparkling wines, suggesting a significant difference. According to the negative t-statistic, sparkling wines cost more than white wines. The average cost of sparkling wines is around 34.80, which is much more than the average cost of white wines, which is 20.62. This finding emphasizes sparkling wines’ premium positioning, which may be related to their connotation of luxury and festivities.
Comparing Rose and Sparkling:
t_test_rose_sparkling <- t.test(rose_wines$price, sparkling_wines$price)
print(t_test_rose_sparkling)
##
## Welch Two Sample t-test
##
## data: rose_wines$price and sparkling_wines$price
## t = -13.153, df = 1380.2, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -25.60013 -18.95523
## sample estimates:
## mean of x mean of y
## 12.52574 34.80343
With a t-statistic of -13.153 and a p-value smaller than 2.2e-16, the t-test comparing rose and sparkling wines reveals a very significant difference in their mean costs. Sparkling wines, with a mean price of 34.80, are significantly more costly than rose wines, with a mean price of 12.53, according to the negative t-statistic and the confidence interval spanning from -25.60013 to -18.95523. The notable disparity in price between these two wine categories might perhaps be attributed to variations in manufacturing complexity, brand positioning, or customer preferences.
Since the price from original dataset does not show a linear relationship, we use the log transformation for the price to show its relationship with rating.
2. Log transformed price verses rating:
ggplot_rp_tf = wine_rating |>
ggplot(aes(x = log(price), y = rating, color = year)) +
geom_point()
ggplotly(ggplot_rp_tf)
Already examined the linear relationship scatter plot of the wine rating verses wine price, we noticed that it’s hardly to be described as a linear relationship based on the distribution of the points. So, we performed logarithm transformation to the price variable and plotted a scatter plot again. The result is somehow more satisfied compared with previous result. The point are more clustered and formed a positive-linear liked relation. To perform more accurate linear model, we thought other variables might also get involved in this regression model. It is more likely to be a multiple expanded linear regression.
3. Fitting linear model to rating, price, and year(residual plot):
lm_rp = lm(rating ~ log(price) + year, data = wine_rating)
broom::tidy(lm_rp) |>
knitr::kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -15.657 | 1.294 | -12.097 | 0 |
| log(price) | 0.268 | 0.002 | 113.488 | 0 |
| year | 0.009 | 0.001 | 14.516 | 0 |
With the thoughts in mind, we performed the MLR with wine rating, log
transformed wine price, and corresponding year of the wine production.
The result are shown in the above table. Things are getting a little bit
tricky. We noticed that p.value of log transformed price is
nearly zero, meaning there is strong linear relation between it and the
dependent variable. But, most of the years have a p.value
that is much greater than the significant level. More justification
method are needed to help provide the evidence to judge the true
relation between year and rating. So, we plotted a residual plot for
this linear regression model.
resid_rp = wine_rating |>
modelr::add_residuals(lm_rp) |>
ggplot(aes(x = log(price), y = resid)) + geom_point()
ggplotly(resid_rp)
From the plot, it’s pretty obvious that the dataset needs more improvement for the model to fit better since all the point are not evenly spread acorss the axis. They are more condensed as the log transformed price increases. But, with such results, we cannot make absolute decision about the model we built is false one. To further test our model, we performed a simulation.
4. Performing Bootstrap for above model for further test:
bootstrap_rp = wine_rating |>
modelr::bootstrap(n = 1000) |>
mutate(
models = map(strap, \(df) lm(rating ~ log(price) + year, data = df)),
results = map(models, broom::tidy)) |>
select(results) |>
unnest(results) |>
filter(term == "log(price)") |>
ggplot(aes(x = estimate)) + geom_density()
ggplotly(bootstrap_rp)
From the above bootstrap model, we obtained a perfect normal distribution of our frequencies of estimation. This helps proved that somehow our model is accurate for no extreme outliers or other strange data points occurred. Also, the distribution of density curve is surprisingly normal distributed, with no skewness and shoulders. This provide positive evidence for our linear regression model. Still, more tests are needed for the model to be exactly accurate.
In the process of exploring the relationships between wine ratings, regions, and prices, we have uncovered a series of intriguing and intricate findings, particularly in the challenging results of the statistical model analysis. We will now delve into each aspect in detail:
1. Ratings versus regions:
Based on the analysis of the relationship between ratings and regions, we found that for each region, regardless of the price, there are wines with high ratings. However, the overall trend of ratings shows that wines with high ratings are mainly distributed in Italy and the United States. The proportion of high-rated wines is somewhat larger in Italy. Among them, the Italian winery Bolgheri Sassicaia is also shown in the bubble graph as having the highest prices and ratings. In the United States, the areas with higher ratings are consistent with our findings, mainly located in Napa Valley, Napa County, California. In summary, the relationship between ratings and regions is evident in specific areas, but due to each person’s unique criteria for judging wines, every region has wines that are highly rated.
2. Price versus regions:
In terms of the relationship between price and region, the UK has the most expensive average wine prices, while Mexico tends to offer more budget-friendly options. At the regional level, France, Italy, and the US stand out as the most expensive regions for wine production. For customers choosing wines from the world’s priciest regions, France proves to be an excellent option, boasting the highest number of expensive wine-producing areas. On the winery level, Pétrus, based in France, takes the lead as the most expensive. Within the US, wine prices generally fall within the range of 0 to 20, making them reasonably affordable for the majority of consumers.
3. Statistical Model Analysis:
Gathering all the results obtained from the statistical model
analysis section. We cannot easily making decision that the model we
applied are the best model for our dataset. The linear model regression
we applied have different results from different test. For our residual
plot and p-value from the groom::tidy() table. It is hard
to conclude the model applied is reliable. But for our bootstrap
simulation, the linear model is somehow fitting. The curve we produced
is perfectly normal, meaning that the simulation is successful. The
fitting model is somehow successful. In sum, the dataset we find need
some improvement or other supportive dataset are needed for further
examination of the model.
The dataset has a small scale after we filtered out some data under conditions. For example, while we do the analysis for the top 10 expensive countries for wine and the most welcoming wine, the sample size of qualified data is small, which may lead to the bias.
Customer Selection: Value-conscious consumers may go toward red or rosé wines, which provide more reasonably priced selections with a greater variety of ratings.
Perception of Quality: According to the ratings, people believe red and sparkling wines to be of greater quality. This perception may be influenced by the way the wines are made, how the market is regarded, or by other elements that are hidden from view in the plots. Additionally, if you’d like to sample some premium alternatives, red wines will the best options.